Node similarity as a basic principle behind 
connectivity in complex networks 
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How are people linked in a highly connected society? Since in many 
networks a power-law (scale-free) node-degree distribution can be observed, 
power-law might be seen as a universal characteristics of networks. But this 
study of communication in the Flickr social online network reveals that power- 
law node-degree distributions are restricted to only sparsely connected net- 
works. More densely connected networks, by contrast, show an increasing 
divergence from power-law. This work shows that this observation is consis- 
tent with the classic idea from social sciences that similarity is the driving 
factor behind communication in social networks. The strong relation between 
communication strength and node similarity could be confirmed by analyzing 
the Flickr network. It also is shown that node similarity as a network for- 
mation model can reproduce the characteristics of different network densities 
and hence can be used as a model for describing the topological transition 
from weakly to strongly connected societies. 

In an increasingly interconnected world, it must ity or, more precisely, the process of network for- 
be of huge interest to understand the topology of mation. Most of the current models are basically 
a highly connected society; important, for exam- focused on reproducing a power-law (scale-free) 
,— 1 ' pie, for predicting the spread of epidemic diseases network topology. The most popular model is a 
J> ■ [1]. A basic measure to describe network topolo- network growth model based on the idea of pref- 
^ ! gies is the distribution of the number of links per erential attachment: new nodes prefer to link to 
5_i "network-node. Many real networks show a node- existing highly connected nodes [6, 2]. But a high 
degree distribution that approximately follows a node-degree may rather be the result than the 
power-law — a right-skewed heavy-tailed distri- cause of connectivity as shown by other models 
bution also known as scale- free distribution [2]. of network formation, including the node copy- 
But other real networks show also a truncated ing model [7] and the fitness model [8, 9]. How- 
power-law or even an exponentially shaped node- ever, most models reproduce quite well a power- 
degree distribution [3, 4, 5]. law distribution, but do not explain sufficiently 
To investigate network topologies it is essential to the above mentioned observed divergences from 
understand the basic principles behind connectiv- power-law which will be focused on in this work. 
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However, social sciences have a long history in 
explaining social communication and interaction 
and a huge amount of literature from this field 
suggests that similarity is the major factor for 
connectivity in social networks as, for example, 
reviewed by McPherson et al. [10]. People tend 
to associate with those sharing similar interests, 
tastes, beliefs, social backgrounds, and also sim- 
ilar popularity. This is often expressed by the 
adage 'Birds of a feather flock together'. 
Recent analysis of mobile phone data further con- 
firms that communication is strongly related to 
geographic distance. There is a higher chance of 
people calling each other if they live closer to each 
other (similar location). Thus, the total amount 
of communication between two cities depends on 
their distance and population size, which can be 
well described by a gravitation model [11, 12]. 
In biology, interactions between proteins or other 
molecules require an exact fit or complementar- 
ity of their complex surfaces which have to be 
treated synonymously with similarity in the con- 
text of connectivity. 

For communication and interaction, space and 
time are often the dominant factors. 'To be in 
the right place at the right time' works often as 
the basic principle for getting connected, but be- 
side fitting in space and time additional proper- 
ties are important: for instance similar surfaces of 
molecules, or similar interests of people. In mo- 
bile phone networks, it can be shown that other 
factors besides geographic distance influence com- 
munication, e.g., language [11]. Such additional 
factors become even more important in virtual 
communities in which geographic distance does 
not matter and written communication does not 
require the presence of the networked partner at 
the same time. 

In information networks, location and time are 
also not the dominant factors. In general, arti- 
cles are linked because of similar topics, scientific 
citations have a strong relevance to the author's 
work, and websites are mostly linked to websites 
of similar content [13]. 

Online social networks are an ideal source for in- 
vestigating complex networks because of the often 
huge number of users, their link and communi- 
cation profiles, and the availability of additional 
metadata such as tags (keywords) . Several recent 
studies confirm the impact of similarity on links 
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Figure 1: Similarity and communication in the Flickr® 
social network. For each pair of a randomly chosen set of 
10,000 Flickr® users, the number of identically used key- 
words (tags) is set into relation with the pair-wise commu- 
nication strengths. The histogram shows the mean com- 
munication strength of all pairs within intervals of 100 key- 
words. This clearly confirms that a higher number of iden- 
tical keywords is strongly related to higher communication. 



in social online networks by analyzing tag (key- 
word) metadata between users [14, 15, 16]. But 
most studies focus on an unweighted contact (de- 
clared friends) network structure. By contrast, 
this study analyzes communication strength be- 
tween users. This provides us a more precise de- 
scription of user interactions in terms of weighted 
links or contact intensities useful for analyzing 
the transition from sparsely connected to densely 
connected networks. In the first step, by analyz- 
ing the Flickr® social online network, this study 
shows that communication strength is directly re- 
lated to tag (keyword) similarity. In the second 
step, the Flickr® network is used to analyze dif- 
ferent network densities. It turns out that more 
densely connected networks show an increasing 
divergence from the power-law distribution. This 
characteristic can be reproduced by a network for- 
mation model based on similarity, as shown by the 
Euclidean distance model proposed in this work. 
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Similarity in the Flickr network 

Flickr® is an online photo sharing commu- 
nity. Here, we analyze how users inter- 
act and communicate by commenting on pho- 
tos of other users. Data were collected in 
2009 by using the application programming 
interface (API) to the Flickr® database at 
http : //www . flickr . com/services/ api/ . 
The number of comments of one user A to another 
user B is used to define the strength of commu- 
nication, and hence gives the weight of the link 
between A and B. Similarity is measured by com- 
paring the keywords (tags) that people use to de- 
scribe their photos. People who use the same key- 
words are supposed to have similar photographic 
interests which, in turn, may lead to communica- 
tion. 

Setting the number of identical keywords into re- 
lation with the number of comments between two 
individuals, as shown in Figure 1, reveals a clear 
dependency between similarity and communica- 
tion strength. The intensity of communication 
between two individuals is strongly related to the 
number of identically used keywords, thereby con- 
firming empirically that communication strength 
depends on similarity between individuals. 

From sparse to dense networks 

In order to investigate how node-degree distribu- 
tions depend on network density, the difference 
between sparsely and densely connected topolo- 
gies is analyzed. Since most networks are rather 
sparsely connected, including the Flickr® network 
as a whole, a more densely connected subset of 
Flickr® is exemplarily chosen: the Flickr-group 
'Light Painters Society' (id:1066685@N25) hav- 
ing 6,036 members (nodes). By using different 
thresholds for the number of comments to be ac- 
cepted as a link, the degree of overall connectivity 
can be varied from sparsely to densely connected 
networks. 

Figure 2 shows the in-degree distribution count- 
ing only strong links (more than or equal to 
20 comments), medium- weighted links (more 
than or equal to 2, 3, or 6 comments), and all, 
including very weak links (at least one comment). 
It reveals that only a sparsely connected network 



shows the typical scale-free power-law like distri- 
bution. Densely connected networks, by contrast, 
show a distribution which is very distinct from 
power-law. 

The node similarity model 

The observed characteristics of real networks can 
be reproduced by a simple similarity model based 
on Euclidean distance in pure random data. This 
is demonstrated by artificially generating a net- 
work from a 100x8000 normally distributed ran- 
dom data matrix X , according to m = 100 prop- 
erties and N = 8000 network nodes. Two nodes 
Xi and xj are defined as connected if their Eu- 
clidean distance d =11 below a certain 
threshold. Increasing this threshold means chang- 
ing the network density from sparsely to densely 
connected. As shown in Figure 3, a similarity 
model generates the same shapes of node-degree 
distributions as observed in the real network (Fig- 
ure 2). 

A MATLAB® implementation of the similarity 
model is available at: 

http : / / www . network-science . org/ similarity- 
model .html . 

Benefits of a similarity based model 

Beside the relation shown between similarity and 
connectivity strength, there are a number of other 
points that show that a similarity model is an ap- 
propriate and natural way to describe real com- 
plex networks: 

a) Most network formation models are developed 
to reproduce only power-law distributions such 
as in Figure 2A. Thus, they cannot explain node- 
degree distributions distinct from a power-law as 
in Figure 2 C to E. A similarity model, by con- 
trast, covers naturally the full observed diversity 
from power-law to non-power-law distributions. 

b) A similarity model does not depend on dynam- 
ics in network size such as an increase or decrease 
in the total number of networks nodes. It there- 
fore works within situations of network growth as 
well as shrinkage or even for pure reorganization 
of links in a network of constant size. Since also 
power-law like distributions can be reproduced 
by the similarity model, the observed power-law 
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Figure 2: Node-degree distributions of the Flickr® online network. Distributions of a densely connected Flickr-group 
are plotted for different connectivity levels on log-log scales. Logarithmic binning is used to show noise-reduced dis- 
tributions (red line). Two individuals are defined to be linked when the number of their comments exceed a certain 
threshold. Different thresholds lead to networks that differ in their overall connectivity level. (A) Counting only strong 
links with more than 20 comments leads to a sparsely connected network showing the typical scale-free power-law 
distribution. (B, C, and D) Moderate thresholds lead to the often observed saturation effects in lower node-degrees: 
the number of nodes of low degree is smaller than expected for a scale- free power-law topology. (E) A densely connected 
network (counting also very weak links of only one comment) does not follow anymore a power-law. 




Figure 3: The similarity model. Based on a random data set, two nodes are defined as connected (similar) when their 
Euclidean distance d is below a certain threshold. The distributions, plotted on log-log scales, depend on network den- 
sity. (A) With a strong threshold only very similar nodes are connected. This represents a sparsely connected network 
showing the typical scale-free power-law like distribution. (B-E) In increasingly connected networks as given by weaker 
thresholds, the number of nodes having a small degree decreases. Thus, a similarity model is able to reproduce the 
same diversity of distributions, from sparsely to densely connected networks, as found in the real network (Figure 2). 
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characteristics of real networks are not necessar- 
ily a result of network growth. 

c) Because of the usually undirected property of 
similarity, it is a natural model for undirected net- 
works in which connections are induced from both 
sides as in social networks. But similarity can also 
be used in a directed manner when additional fac- 
tors such as time in a growth model (e.g., citation 
network) enforce directed relations. 

d) Similarity does not require global knowledge 
such as node-degree about all network nodes. 
Similarity refers only to the local environment of 
people in real physical as well as in virtual com- 
munication worlds. People who live in the same 
place, engage in similar activities, or members of 
online communities meet each other and connect 
according to their similar behaviors and interests 

— a global knowledge about all people is not nec- 
essary. 

e) A similarity model explains the topological 
transition from sparsely to densely or even com- 
pletely connected networks which a pure power- 
law model does not. Completely connected net- 
works in which each node is connected to each 
other do not follow a power-law distribution, in- 
stead, all N nodes have the same maximum de- 
gree, k = N — 1. Thus, with increasing connec- 
tivity there must be a transition from the power- 
law topologies (Fig. 3A) of sparsely connected 



networks to the peaked distributions (Fig. 4E) 
of completely connected networks. A similar- 
ity model explains the transition from sparsely 
to densely connected networks as shown in Fig- 
ure 3 and, in addition, to completely connected 
networks (Fig. 4). For almost completely con- 
nected networks the similarity model predicts a 
left-skewed distribution inverse to the power-law 
in which most nodes have a high degree and only 
a few nodes have a low degree. 

Conclusion 

This work demonstrates that the frequently ob- 
served scale-free power-law distribution can be 
well reproduced by a model which is purely based 
on the idea of node similarity. Since similar- 
ity is independent of dynamics in network size 
such as growth or shrinkage, the observed power- 
law of real networks is not necessarily caused 
by the growth of networks. In addition, a 
similarity model shows that the frequently ob- 
served distributions distinct from power-law are 
a characteristics of more densely connected net- 
works. This means that the differences we can ob- 
serve in node-degree distributions of real networks 
are mainly given by their overall link density: 
whereas the typical sparsely connected networks 
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Figure 4: Topological changes from densely to almost completely connected networks. Plotted are the node-degree dis- 
tributions of increasingly connected networks generated by the similarity model (A-E). In almost completely connected 
networks (D and E) the node-degree distribution appears as an inverse power-law: most nodes have a high degree 
whereas only few nodes have a low degree. 
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show power-law distributions, densely connected 
networks show non-power-law distributions. This 
can be further extended to almost completely con- 
nected networks as can be found in a family or 
a small village in which everyone knows every- 
one else. While in sparsely connected power-law 
networks most nodes have a low number of links 
and only a few are highly linked, almost com- 
pletely connected networks show the opposite: 
most nodes have a high or even maximum degree 
and only a few nodes have lower degrees. These 
less connected nodes may represent outsiders in 
an almost completely connected clique. Since a 
similarity model explains the entire topological 
transition from sparsely to densely connected net- 
works it is able to explain the transition from 
lowly connected to highly connected societies. 
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